Efficient Extractive Summarization in a Metric Space

Introduction

In this article we explore how to efficiently select a representative subset of items from a larger set. We are given a "universe" (set) of points in a metric space and wish to summarize these points by selecting relatively few representative points. The goal is to choose the summary so that as many points in the universe as possible are as close as possible to a point in the summary. In this article we define the precise problem, mention an approximation algorithm in the literature, and present a much more efficient algorithm to achieve the same results as the approximation algorithm.

Precise Formulation

Let the universe be a set of items in a metric space with metric . Let the "cost" of an item in being a certain distance from an item in the summary be given by a positive, strictly increasing function . Let be a radius such that if an item in is not within distance of any point in the summary, it doesn't matter how far it actually is (for example, we might consider the item not represented at all by the summary at that point). A summary is a subset of .

Now we are ready for the main definitions. Given summary and , the minimum distance to the summary is The total cost of the summary is: In other words, we sum over all items in the cost associated with each item's minimum distance to summary , where the minimum distance is capped at the radius . Note that is strictly decreasing: adding any item to decreases .

Given a target summary size, the objective is to choose of that size minimizing the total cost. By a reduction from Geometric Set Cover, our problem is NP-Complete so we shouldn't hope for an exact and efficient solution. Our goal is to find an efficient solution that is not too far from the exact solution.

Prior Work

The paper [1] presents a polynomial-time approximation algorithm to this problem. The algorithm is greedy, starting with the summary and repeatedly adding to the item in that minimizes , or specifically, .

This algorithm works by iteratively computing for each , which requires iterating over and for each , iterating over for each such in the computation of . Thus each addition to requires computation and if and the final summary has size , the algorithm runs in time .

It's worth noting that this paper does not consider a radius —in other words, it is as if in our formulation. We add as a mild modification to admit an efficient algorithm. In most applications far apart points are not remotely similar, so adding does not materially change the problem.

Our Approach

Overview

Given an item , let Like the algorithm in [1], at each step our algorithm selects the element that most minimizes the total cost to add to the summary - that is, we add to . To efficiently find , we maintain a sorted set of all sorted by . The sorted set allows us to immediately select ; however, upon selecting it, we need to compute for . Our observation is that for only specific , there is a simple criterion to determine these , and there is an efficient algorithm to obtain them.

Thus at a high level our algorithm is to start with and repeat the following steps times, where is the desired size of the summary (or until some other stopping condition is reached):

  1. Remove from the sorted set.
  2. Identify .
  3. For each identified above, compute and re-sort in the sorted set.
  4. Set .
Steps 1 and 4 are trivial. The key observations to implementing steps 2 and 3 efficiently is that step 2 only requires examining items that are close to and step 3 only requires examining items that are close to these . Details follow.

Locality

The algorithm maintains for all . Let the "neighborhood" of an item be . Note that so that depends only on the neighbors of that are in . Thus only if . Furthermore, the only items for which are those that are close to items for which —this statement is stated more precisely in and follows from the lemma below.

These locality observations allow us to precisely specify steps 2 and 3 of the high level algorithm above. Specifically:

Lemma

Proof since terms for all other are . For any in the sum, the term for in cannot be larger than in , which can be seen by separately considering the cases , , and . Therefore because is strictly increasing, iff such that

If then the right hand side of equals the left; if then both sides of are zero. Either way, 's term in is unchanged.

Conversely, If and then and so regardless of which term on the right hand side of equals the left hand side, inequality must hold in . Therefore .

Complete Algorithm

Store in a spatial index that can be searched for neighbors . Start with , set and compute for each . Store in a priority queue or sorted set ordered by descending. Then repeat the following steps as many times as desired:
  1. Remove the first element of the priority queue or sorted set and call it .
  2. Compute .
  3. Compute .
  4. For all , set .
  5. For all , set and re-sort it in the priority queue or sorted set.
  6. Set .

Complexity

Let , , and . Also let upper bound the time to query the spatial index and access the priority queue or sorted set. Then this algorithm runs in time . The space complexity is dominated by the spatial index. If all neighborhoods are pre-computed then the required space is ; otherwise typically it is .

Given 's cubic dependence on , the performance of this algorithm critically depends on neighborhoods being small. Thus should be chosen so that and ideally . Efficient spatial indexing and sorted data structures typically allow . If both and then , which equals if , as it typically would. Compare to the original algorithm's time .

Conclusion

We have demonstrated a simple and efficient algorithm for extractive summarization in a metric space. Given mild assumptions, it performs far faster than the original algorithm presented.

Code

You may find a reference implementation of the above algorithm here.

References

  1. H. Lin, J. Bilmes and S. Xie, "Graph-based submodular selection for extractive summarization," 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, Merano, 2009, pp. 381-386.